Name:Qingcheng Wei

Collaborated with:

This homework is due on Feb. 27st at 11:55 pm.

Instruction: fill your answers in the .Rmd, compile it to HTML/PDF and submit the complied file. Uncompiled .Rmd file will not be graded.

Remark. This homework aims to help you understand PCA and its applications.

1 NBA Dataset

Import the dataset from “nba-teams-2017.csv”. Create a new data.frame that contains the following columns:

1.0.1 (a) (5 pt) Create box plots of the quantitative features (i.e. all but) teams to see if you should scale the data when performing PCA. Describe your findings in words.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data=read.csv("nba-teams-2017.csv")
col=data.frame(data[,c("team","wins","points","points3","free_throws","off_rebounds","def_rebounds","assists","steals","personal_fouls")])
boxplot(col %>% select(-team))

we should scale the data as the variance for wins is much larger than the others.


1.0.2 (b) (5 pt) Obtain PC loadings of the first four princple components (PCs). Only display the first few elements of each loading in your report.

pca_bball <- prcomp(col %>% select(-team), scale = TRUE)
pca_bball$rotation[,1:4]
##                        PC1         PC2         PC3         PC4
## wins           -0.42366308  0.07555818 -0.11681772  0.21657745
## points         -0.50246947 -0.21708761  0.19677723 -0.07946391
## points3        -0.41664778  0.17268215 -0.08316881 -0.51204012
## free_throws    -0.24452950 -0.41519726  0.30946852 -0.33272209
## off_rebounds    0.08111297 -0.39160091  0.47926467  0.46463787
## def_rebounds   -0.26053718  0.26461371  0.57662166  0.15459073
## assists        -0.45236958  0.05322237 -0.26482231  0.27220000
## steals         -0.20525546 -0.41895791 -0.45168498  0.37698249
## personal_fouls  0.11583180 -0.58585494 -0.09277492 -0.34335891

1.0.3 (c) (5 pt) Generate a scree plot describing the amount explained by the various PCs.

screeplot(pca_bball, type = "l", npcs = 15, main = "Screeplot of the first 10 PCs")
abline(h = 1, col="red", lty=5)
legend("topright", legend=c("Eigenvalue = 1"),
       col=c("red"), lty=5, cex=0.6)


1.0.4 (d) (5 pt) Make another plot showing the cumulative percent of the variance explained.

Precisely: for each \(1\leq k \leq 10\) you are plotting:

\[\frac{\sum_{j=1}^k d_{j}^2}{\sum_{j=1}^{10} d_j^2}. \]

cumpro <- cumsum(pca_bball$sdev^2 / sum(pca_bball$sdev^2))
plot(cumpro[0:15], xlab = "PC #", ylab = "Amount of explained variance", main = "Cumulative variance plot")
abline(h = 0.9, col="blue", lty=5)
legend("topleft", legend=c("Cut-off @ PC6"),
       col=c("blue"), lty=5, cex=0.6)


1.0.5 (e) (5 pt) If you were to retain all PCs which explain at least 90% of the variance, how many PCs would you retain?

By looking at the plot, if we want to retain 90% of the variance, we would keep 5 PCs.


1.0.6 (f) (10 pt) Plot PC1 vs PC2 with the team names and try to interpret your findings.

pca_scores <- pca_bball$x # PCs/ PC scores 

low_dim_rep <- pca_scores %>% 
  data.frame() %>% 
  mutate(team = col$team) %>% 
  select(team, everything())

library(ggplot2)
ggplot(low_dim_rep, aes(x = PC1, y = PC2)) +
 geom_vline(xintercept = 0) +
  geom_hline(yintercept = 0) +
  geom_text(aes(label = team), size = 2) +
  coord_cartesian(xlim = c(-8, 5)) +
  theme_linedraw()

The team that’s close with each other within the plot shows similarity in its data and the team that’s far away each other shows difference in its performance. In fact, for example, the Phoniex Suns has a larger variation with the rest of the team, wehreas GSW did extremly well in 2017 also sit away from the center of the plot.


2 RedfinHouse Image

Import the image from “Redfin_house.png”. Let \(X\) be the pixel intensity associated with the red color in the image.

Hints.

2.0.1 (a) (5 pt) What are the dimensions of \(X\)? Plot a histogram of the pixel intensities within the image.

library(png, quietly = TRUE)
library(grid, quietly = TRUE)
directory <- "Redfin_house.png"
img <- readPNG(directory)
img_plot <- as.raster(img)
grid.raster(img_plot)

img_red_only <- img[, , 1]
img.out <- prcomp(img_red_only, scale = TRUE)
dim(img.out$x)
## [1] 505 505
hist(img_red_only,
     main = "The pixel intensities of Img",
     col = "blue")


2.0.2 (b) (10 pt) Now let’s do PCA for the row vectors in \(X\). Plot the scree plots for this data, which illustrate the percentage variation explained against the number of principal components and the cumulative percentage variation explained against the number of principal components. How many PCs are needed to explain 90% of the total variation of \(X\)?

pr.var <- img.out$sdev^2
pve <- pr.var / sum(pr.var)
par(mfrow = c(1,2))
plot(pve, xlab = "Principal Component", ylab = "Proportion of Variance Explained")
plot(cumsum(pve), xlab = "Principal Component", ylab = "Cumulative Proportion of 
     Variance Explained", ylim = c(0, 1), type = "b", xlim = c(1, 50))

cum_explained1=cumsum((img.out$sdev)^2)/sum((img.out$sdev)^2)
cat("Num of PCs to explain at least 90% of variation is ", length(cum_explained1[cum_explained1<=0.9])+1)
## Num of PCs to explain at least 90% of variation is  24
cumsum(pve)
##   [1] 0.2962951 0.4951405 0.5675534 0.6340435 0.6792437 0.7156087 0.7430687
##   [8] 0.7665766 0.7861470 0.8038898 0.8202550 0.8311917 0.8407286 0.8497609
##  [15] 0.8572818 0.8645586 0.8705091 0.8756694 0.8806781 0.8853024 0.8895810
##  [22] 0.8934217 0.8970729 0.9006134 0.9038637 0.9069463 0.9098661 0.9126925
##  [29] 0.9152774 0.9177672 0.9201859 0.9224303 0.9246450 0.9267792 0.9286603
##  [36] 0.9304212 0.9321713 0.9338493 0.9354605 0.9370064 0.9384820 0.9399005
##  [43] 0.9412789 0.9426322 0.9439237 0.9451499 0.9463496 0.9475123 0.9486630
##  [50] 0.9497718 0.9508557 0.9519179 0.9529472 0.9539513 0.9549048 0.9558311
##  [57] 0.9567263 0.9576004 0.9584558 0.9592755 0.9600771 0.9608612 0.9616328
##  [64] 0.9623931 0.9631364 0.9638241 0.9645064 0.9651838 0.9658453 0.9664672
##  [71] 0.9670817 0.9676880 0.9682927 0.9688794 0.9694526 0.9700124 0.9705553
##  [78] 0.9710948 0.9716179 0.9721251 0.9726267 0.9731236 0.9736068 0.9740838
##  [85] 0.9745505 0.9750057 0.9754508 0.9758757 0.9762957 0.9767066 0.9771151
##  [92] 0.9775173 0.9779133 0.9782967 0.9786721 0.9790414 0.9794035 0.9797594
##  [99] 0.9801127 0.9804630 0.9808028 0.9811334 0.9814564 0.9817667 0.9820760
## [106] 0.9823747 0.9826686 0.9829537 0.9832315 0.9835082 0.9837785 0.9840467
## [113] 0.9843090 0.9845685 0.9848193 0.9850666 0.9853026 0.9855349 0.9857645
## [120] 0.9859926 0.9862141 0.9864306 0.9866463 0.9868562 0.9870650 0.9872713
## [127] 0.9874734 0.9876743 0.9878715 0.9880604 0.9882478 0.9884298 0.9886111
## [134] 0.9887921 0.9889682 0.9891398 0.9893087 0.9894752 0.9896397 0.9897975
## [141] 0.9899545 0.9901097 0.9902622 0.9904129 0.9905633 0.9907129 0.9908582
## [148] 0.9910011 0.9911416 0.9912818 0.9914179 0.9915510 0.9916809 0.9918088
## [155] 0.9919354 0.9920618 0.9921836 0.9923033 0.9924228 0.9925390 0.9926534
## [162] 0.9927666 0.9928780 0.9929866 0.9930933 0.9931977 0.9933017 0.9934040
## [169] 0.9935035 0.9936022 0.9936986 0.9937934 0.9938876 0.9939791 0.9940692
## [176] 0.9941571 0.9942445 0.9943312 0.9944176 0.9945003 0.9945814 0.9946621
## [183] 0.9947420 0.9948205 0.9948982 0.9949751 0.9950508 0.9951261 0.9952003
## [190] 0.9952738 0.9953451 0.9954158 0.9954856 0.9955548 0.9956229 0.9956905
## [197] 0.9957579 0.9958238 0.9958889 0.9959532 0.9960163 0.9960772 0.9961379
## [204] 0.9961978 0.9962573 0.9963165 0.9963748 0.9964322 0.9964882 0.9965435
## [211] 0.9965985 0.9966518 0.9967049 0.9967571 0.9968076 0.9968575 0.9969068
## [218] 0.9969554 0.9970033 0.9970508 0.9970971 0.9971433 0.9971891 0.9972337
## [225] 0.9972780 0.9973209 0.9973632 0.9974050 0.9974467 0.9974874 0.9975277
## [232] 0.9975673 0.9976062 0.9976445 0.9976823 0.9977195 0.9977564 0.9977930
## [239] 0.9978293 0.9978650 0.9978997 0.9979341 0.9979677 0.9980010 0.9980335
## [246] 0.9980653 0.9980969 0.9981284 0.9981590 0.9981892 0.9982194 0.9982492
## [253] 0.9982785 0.9983073 0.9983358 0.9983640 0.9983915 0.9984188 0.9984459
## [260] 0.9984726 0.9984988 0.9985246 0.9985498 0.9985748 0.9985994 0.9986233
## [267] 0.9986471 0.9986707 0.9986941 0.9987169 0.9987395 0.9987615 0.9987832
## [274] 0.9988049 0.9988259 0.9988467 0.9988672 0.9988873 0.9989072 0.9989267
## [281] 0.9989458 0.9989647 0.9989834 0.9990019 0.9990203 0.9990384 0.9990562
## [288] 0.9990739 0.9990911 0.9991079 0.9991246 0.9991411 0.9991573 0.9991734
## [295] 0.9991890 0.9992044 0.9992196 0.9992343 0.9992489 0.9992634 0.9992775
## [302] 0.9992916 0.9993054 0.9993189 0.9993321 0.9993453 0.9993582 0.9993709
## [309] 0.9993834 0.9993958 0.9994080 0.9994202 0.9994320 0.9994438 0.9994553
## [316] 0.9994665 0.9994775 0.9994884 0.9994992 0.9995099 0.9995202 0.9995306
## [323] 0.9995406 0.9995504 0.9995600 0.9995696 0.9995788 0.9995878 0.9995967
## [330] 0.9996056 0.9996144 0.9996230 0.9996314 0.9996396 0.9996479 0.9996559
## [337] 0.9996637 0.9996714 0.9996789 0.9996862 0.9996934 0.9997005 0.9997075
## [344] 0.9997144 0.9997211 0.9997278 0.9997343 0.9997408 0.9997472 0.9997534
## [351] 0.9997593 0.9997652 0.9997710 0.9997767 0.9997823 0.9997877 0.9997931
## [358] 0.9997983 0.9998035 0.9998085 0.9998136 0.9998184 0.9998231 0.9998278
## [365] 0.9998324 0.9998368 0.9998412 0.9998454 0.9998497 0.9998538 0.9998579
## [372] 0.9998620 0.9998660 0.9998698 0.9998736 0.9998773 0.9998810 0.9998846
## [379] 0.9998881 0.9998916 0.9998950 0.9998983 0.9999016 0.9999048 0.9999079
## [386] 0.9999109 0.9999139 0.9999168 0.9999196 0.9999224 0.9999250 0.9999276
## [393] 0.9999302 0.9999327 0.9999352 0.9999376 0.9999399 0.9999422 0.9999444
## [400] 0.9999466 0.9999487 0.9999507 0.9999527 0.9999546 0.9999565 0.9999584
## [407] 0.9999601 0.9999618 0.9999635 0.9999651 0.9999667 0.9999682 0.9999697
## [414] 0.9999711 0.9999725 0.9999739 0.9999752 0.9999764 0.9999776 0.9999788
## [421] 0.9999799 0.9999810 0.9999821 0.9999832 0.9999842 0.9999851 0.9999861
## [428] 0.9999870 0.9999878 0.9999886 0.9999894 0.9999902 0.9999909 0.9999916
## [435] 0.9999923 0.9999929 0.9999935 0.9999941 0.9999946 0.9999951 0.9999955
## [442] 0.9999959 0.9999963 0.9999966 0.9999969 0.9999971 0.9999974 0.9999976
## [449] 0.9999977 0.9999979 0.9999981 0.9999982 0.9999983 0.9999984 0.9999985
## [456] 0.9999986 0.9999987 0.9999987 0.9999988 0.9999989 0.9999989 0.9999990
## [463] 0.9999990 0.9999991 0.9999991 0.9999992 0.9999992 0.9999993 0.9999993
## [470] 0.9999993 0.9999994 0.9999994 0.9999995 0.9999995 0.9999995 0.9999995
## [477] 0.9999996 0.9999996 0.9999996 0.9999996 0.9999997 0.9999997 0.9999997
## [484] 0.9999997 0.9999998 0.9999998 0.9999998 0.9999998 0.9999998 0.9999998
## [491] 0.9999999 0.9999999 0.9999999 0.9999999 0.9999999 0.9999999 0.9999999
## [498] 0.9999999 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [505] 1.0000000

2.0.3 (c) (10 pt) For \(d = 1, 5, 10, 15, 20, 30, 50, 100, 200\) project the image onto the first \(d\) principal components and plot the resulting compressed image for each \(d\). For each of the nine plots, include the cumulative percentage variation explained by the projection in the title of your plots.

pr.out=img.out
W <- pr.out$rotation #the loading matrix
pc.image <- list()
num.pcs <- c(1, 5, 10, 15, 20, 30, 50, 100, 200)

#scale the original image
Image <- scale(img_red_only)
for(j in 1:length(num.pcs)){
  u.proj <- W
  #we will only use the first num.pcs PC loadings so set the remaining to 0
  u.proj[, (num.pcs[j] + 1) : 396] <- 0 
  
  #Make the projection
  projection <- (Image%*%u.proj)%*%t(u.proj)
  
  #to draw an image, values need to be between 0 and 1
  scaled <- (projection - min(as.numeric(projection)))
  scaled <- scaled / max(as.numeric(scaled))
  pc.image[[j]] <- as.raster(scaled)
}

#plot each of the images
  grid.raster(pc.image[[1]])

  cumsum(pve)[1]
## [1] 0.2962951
grid.raster(pc.image[[2]])

  cumsum(pve)[5]
## [1] 0.6792437
grid.raster(pc.image[[3]])

  cumsum(pve)[10]
## [1] 0.8038898
grid.raster(pc.image[[4]])

cumsum(pve)[15]
## [1] 0.8572818
grid.raster(pc.image[[5]])

  cumsum(pve)[20]
## [1] 0.8853024
grid.raster(pc.image[[6]])

  cumsum(pve)[30]
## [1] 0.9177672
grid.raster(pc.image[[7]])

  cumsum(pve)[50]
## [1] 0.9497718
grid.raster(pc.image[[8]])

  cumsum(pve)[100]
## [1] 0.980463
grid.raster(pc.image[[9]])

  cumsum(pve)[200]
## [1] 0.9959532